7 research outputs found

    Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

    Get PDF
    The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The C++ template interface provided allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization

    Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library

    Full text link
    We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur

    Matrix multiplication software and results bundle for paper "Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library" for P^3MA submission

    No full text
    <p>This is the archive containing the matrix multiplication software and the results of the publication "<em>Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library</em>" submitted to the P^3MA workshop 2017.</p> <p><strong>The archive has the following content:</strong></p> <ul> <li>Source code for the (tiled) matrix multiplication in "src": <ul> <li>regular version in "src/matmul": <ul> <li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li> <li>Branch: topic-compatible-alpaka-0-1-0</li> <li>Commit: a63ba4810d6bfcca62c68dd57408af15028e78a3</li> </ul> </li> <li>forked version for XL in "src/matmul": <ul> <li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li> <li>Branch: topic-xl-workaround</li> <li>Commit: 1fee028eccb8cf7b677e8071233e08aa9f81846a</li> </ul> </li> </ul> </li> <li>The compiled binaries and the results of the tuning and scaling runs are in "runs" in sub folders for each type of run and architectures.</li> </ul

    PIConGPU, Alpaka, and cupla software bundle for IWOPH 2016 submission

    No full text
    <p>This is the archive containing the software used for evaluations in the publication "Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond" submitted to the international workshop on OpenPOWER for HPC 2016.</p> <p>The archive has the following content:</p> <p>PIConGPU Kelvin-Helmholtz Simulation code (picongpu-alpaka/):</p> <ul> <li> Remote: https://github.com/psychocoderHPC/picongpu-alpaka.git</li> <li> Branch: topic-scaling</li> <li> Commit: 1f004c8e0514ad1649f3958a6184878af6e75150</li> </ul> <p>Alpaka code (alpaka/):</p> <ul> <li>Remote: https://github.com/psychocoderHPC/alpaka.git</li> <li>Branch: topic-picongpu-alpaka</li> <li>Commit: 4a6dd35a9aff62e7f500623c3658685f827f73e5</li> </ul> <p>Cupla (cupla/):</p> <ul> <li>Remote: https://github.com/psychocoderHPC/cupla.git</li> <li>Branch: topic-dualAccelerators</li> <li>Commit: 4660f5fd8e888aa732230946046219f7e5daa1c9</li> </ul> <p>The simulation was executed for one thousand time steps and the following configuration:</p> <ul> <li>   shape is higher then CIC, we used TSC</li> <li>   pusher is Boris</li> <li>   current solver is Esirkepov (optimized, generalized)</li> <li>   Yee field solver</li> <li>   trilinear interpolation in field gathering</li> <li>   16 particles per cell</li> </ul> <p>Compile flags:</p> <ul> <li>CPU g++-4.9.2: -g0 -O3 -m64 -funroll-loops -march=native -ffast-math --param max-unroll-times=512</li> <li>GPU nvcc: --use_fast_math --ftz=false -g0 -O3 -m64</li> </ul

    Scalable, Data Driven Plasma Simulations with PIConGPU

    No full text
    PIConGPU is an open source, multi-platform particle-in-cell code scaling to the fastest supercomputers in the TOP500 list. We present the architecture, novel developments, and workflows that enable high-precision, fast turn-around computations on Exascale-machines. Furthermore, we present our strategies to handle extreme data flows from thousands of GPUs for analysis with in situ processing and open data formats (openPMD). PIConGPU is since recently furthermore natively controlled by a Python Jupyter interface and we research just-in-time kernel generation for C++ with our Cling-CUDA extensions.Invited minisymposium talk at the Platform for Advanced Scientific Computing (PASC) Conference (PASC19) at ETH Zurich (Zurich, Switzerland)

    Talk "Next-Generation Simulations for XFEL-Plasma Interactions with Solid Density Targets with PIConGPU"

    No full text
    PIConGPU reportedly is the fastest particle-in-cell code in the world with respect to sustained Flop/s. Written in performance-portable, single-source C++ we constantly push the envelope towards Exascale laser-plasma modeling. However, solving previously week-long simulation tasks in a few hours with a speedy framework is only the beginning. This talk will present the architecture and recent additions driving PIConGPU. As we speak, we run on the fastest machines and the community approaches a new generation of TOP10 clusters. Within those, many-core computing architectures and severe limitations in available I/O bandwidth demand fundamental rethinking of established modeling workflows towards in situ-processing. We present our ready-to-use open-source solutions and address scientific repeatability, data-reduction in I/O, predictability and new atomic modeling for XFEL pump-probe experiments
    corecore